Setup¶
In [8]:
import pandas as pd
import numpy as np
import sweetviz as sv
In [9]:
data = pd.concat([pd.read_csv("../data/clean/train.csv"),
pd.read_csv("../data/clean/test.csv")]).reset_index(drop=True)
Data profiling¶
In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72739 entries, 0 to 72738 Data columns (total 43 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 host_response_rate 65965 non-null float64 1 host_acceptance_rate 66597 non-null float64 2 latitude 72739 non-null float64 3 longitude 72739 non-null float64 4 accommodates 72739 non-null int64 5 bathrooms 72718 non-null float64 6 bedrooms 72567 non-null float64 7 beds 72515 non-null float64 8 price 72739 non-null float64 9 minimum_nights 72739 non-null int64 10 maximum_nights 72739 non-null int64 11 availability_30 72739 non-null int64 12 availability_60 72739 non-null int64 13 availability_90 72739 non-null int64 14 availability_365 72739 non-null int64 15 number_of_reviews 72739 non-null int64 16 number_of_reviews_ltm 72739 non-null int64 17 number_of_reviews_l30d 72739 non-null int64 18 review_scores_rating 57901 non-null float64 19 review_scores_accuracy 57856 non-null float64 20 review_scores_cleanliness 57855 non-null float64 21 review_scores_checkin 57853 non-null float64 22 review_scores_communication 57857 non-null float64 23 review_scores_location 57851 non-null float64 24 review_scores_value 57852 non-null float64 25 calculated_host_listings_count 72739 non-null int64 26 calculated_host_listings_count_entire_homes 72739 non-null int64 27 calculated_host_listings_count_private_rooms 72739 non-null int64 28 calculated_host_listings_count_shared_rooms 72739 non-null int64 29 reviews_per_month 57901 non-null float64 30 host_is_superhost_flag 72739 non-null int64 31 host_has_profile_pic_flag 72739 non-null int64 32 host_identity_verified_flag 72739 non-null int64 33 has_availability_flag 72739 non-null int64 34 instant_bookable_flag 72739 non-null int64 35 host_email_verified_flag 72739 non-null int64 36 host_phone_verified_flag 72739 non-null int64 37 host_work_email_verified_flag 72739 non-null int64 38 host_response_time 65965 non-null object 39 property_type 72739 non-null object 40 room_type 72739 non-null object 41 state 72739 non-null object 42 city 72739 non-null object dtypes: float64(16), int64(22), object(5) memory usage: 23.9+ MB
In [11]:
# Numerical features
data.describe(include=['int64','float64'],
percentiles=[0.01]+np.arange(0.1,1,0.1).tolist()+[0.99]).T
Out[11]:
| count | mean | std | min | 1% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 99% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| host_response_rate | 65965.0 | 95.893019 | 15.489104 | 0.000000 | 0.000000 | 92.000000 | 99.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.00000 |
| host_acceptance_rate | 66597.0 | 88.525504 | 21.428897 | 0.000000 | 0.000000 | 64.000000 | 83.000000 | 92.000000 | 96.000000 | 98.000000 | 99.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.00000 |
| latitude | 72739.0 | 34.002513 | 2.997901 | 30.097450 | 30.196625 | 30.286110 | 32.734058 | 32.796360 | 33.771102 | 33.993103 | 34.047783 | 34.086027 | 34.139844 | 41.764362 | 41.967421 | 42.02220 |
| longitude | 72739.0 | -110.288256 | 11.438353 | -118.917134 | -118.640056 | -118.444862 | -118.373404 | -118.314606 | -118.189719 | -117.892080 | -117.168598 | -97.784765 | -97.704548 | -87.752817 | -87.616778 | -87.52842 |
| accommodates | 72739.0 | 4.575757 | 3.185178 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 4.000000 | 4.000000 | 6.000000 | 6.000000 | 8.000000 | 16.000000 | 16.00000 |
| bathrooms | 72718.0 | 1.635276 | 1.109639 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.500000 | 2.000000 | 2.000000 | 3.000000 | 6.000000 | 50.00000 |
| bedrooms | 72567.0 | 1.862568 | 1.377400 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 | 4.000000 | 6.000000 | 50.00000 |
| beds | 72515.0 | 2.439716 | 2.165706 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 | 4.000000 | 5.000000 | 10.000000 | 132.00000 |
| price | 72739.0 | 283.583016 | 670.223217 | 5.000000 | 32.000000 | 63.000000 | 87.000000 | 110.000000 | 132.000000 | 159.000000 | 194.000000 | 240.000000 | 319.000000 | 500.000000 | 2103.720000 | 56425.00000 |
| minimum_nights | 72739.0 | 13.177745 | 21.906856 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 | 4.000000 | 30.000000 | 30.000000 | 31.000000 | 60.000000 | 1000.00000 |
| maximum_nights | 72739.0 | 454.880229 | 404.497425 | 1.000000 | 7.000000 | 28.000000 | 60.000000 | 180.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 1125.000000 | 1125.000000 | 1125.000000 | 3650.00000 |
| availability_30 | 72739.0 | 15.386931 | 11.162401 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 6.000000 | 11.000000 | 15.000000 | 20.000000 | 25.000000 | 29.000000 | 30.000000 | 30.000000 | 30.00000 |
| availability_60 | 72739.0 | 35.211702 | 20.586669 | 0.000000 | 0.000000 | 2.000000 | 13.000000 | 23.000000 | 31.000000 | 37.000000 | 45.000000 | 53.000000 | 58.000000 | 60.000000 | 60.000000 | 60.00000 |
| availability_90 | 72739.0 | 58.210712 | 28.757990 | 0.000000 | 0.000000 | 9.800000 | 31.000000 | 45.000000 | 56.000000 | 64.000000 | 73.000000 | 82.000000 | 88.000000 | 90.000000 | 90.000000 | 90.00000 |
| availability_365 | 72739.0 | 222.487386 | 113.363258 | 0.000000 | 4.000000 | 61.000000 | 91.000000 | 148.000000 | 180.000000 | 244.000000 | 270.000000 | 318.000000 | 345.000000 | 363.000000 | 365.000000 | 365.00000 |
| number_of_reviews | 72739.0 | 47.677656 | 92.212599 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 5.000000 | 11.000000 | 22.000000 | 39.000000 | 70.000000 | 137.000000 | 439.000000 | 3689.00000 |
| number_of_reviews_ltm | 72739.0 | 12.210217 | 19.235970 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 3.000000 | 7.000000 | 14.000000 | 23.000000 | 37.000000 | 79.000000 | 666.00000 |
| number_of_reviews_l30d | 72739.0 | 1.043498 | 1.791507 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 | 7.000000 | 60.00000 |
| review_scores_rating | 57901.0 | 4.793315 | 0.374320 | 1.000000 | 3.000000 | 4.500000 | 4.690000 | 4.790000 | 4.850000 | 4.900000 | 4.940000 | 4.980000 | 5.000000 | 5.000000 | 5.000000 | 5.00000 |
| review_scores_accuracy | 57856.0 | 4.816045 | 0.354837 | 1.000000 | 3.000000 | 4.560000 | 4.740000 | 4.820000 | 4.880000 | 4.920000 | 4.950000 | 4.990000 | 5.000000 | 5.000000 | 5.000000 | 5.00000 |
| review_scores_cleanliness | 57855.0 | 4.772888 | 0.382629 | 1.000000 | 3.000000 | 4.470000 | 4.670000 | 4.760000 | 4.830000 | 4.880000 | 4.930000 | 4.970000 | 5.000000 | 5.000000 | 5.000000 | 5.00000 |
| review_scores_checkin | 57853.0 | 4.865768 | 0.318956 | 1.000000 | 3.500000 | 4.670000 | 4.820000 | 4.890000 | 4.930000 | 4.960000 | 4.980000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.00000 |
| review_scores_communication | 57857.0 | 4.869514 | 0.326593 | 1.000000 | 3.500000 | 4.680000 | 4.830000 | 4.900000 | 4.940000 | 4.970000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.00000 |
| review_scores_location | 57851.0 | 4.797704 | 0.340652 | 1.000000 | 3.330000 | 4.500000 | 4.690000 | 4.790000 | 4.850000 | 4.900000 | 4.940000 | 4.980000 | 5.000000 | 5.000000 | 5.000000 | 5.00000 |
| review_scores_value | 57852.0 | 4.714665 | 0.405423 | 0.000000 | 3.000000 | 4.400000 | 4.600000 | 4.690000 | 4.760000 | 4.810000 | 4.860000 | 4.900000 | 4.970000 | 5.000000 | 5.000000 | 5.00000 |
| calculated_host_listings_count | 72739.0 | 20.362172 | 67.831619 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 | 5.000000 | 9.000000 | 17.000000 | 44.000000 | 549.000000 | 569.00000 |
| calculated_host_listings_count_entire_homes | 72739.0 | 18.411609 | 67.650574 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 | 6.000000 | 13.000000 | 38.000000 | 549.000000 | 569.00000 |
| calculated_host_listings_count_private_rooms | 72739.0 | 1.412076 | 5.581708 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 | 27.000000 | 89.00000 |
| calculated_host_listings_count_shared_rooms | 72739.0 | 0.217751 | 2.588695 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 | 61.00000 |
| reviews_per_month | 57901.0 | 1.783405 | 1.869215 | 0.010000 | 0.030000 | 0.140000 | 0.290000 | 0.510000 | 0.850000 | 1.230000 | 1.760000 | 2.330000 | 3.050000 | 4.120000 | 7.570000 | 56.46000 |
| host_is_superhost_flag | 72739.0 | 0.444562 | 0.496921 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 |
| host_has_profile_pic_flag | 72739.0 | 0.972848 | 0.162527 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 |
| host_identity_verified_flag | 72739.0 | 0.906735 | 0.290805 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 |
| has_availability_flag | 72739.0 | 0.990761 | 0.095673 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 |
| instant_bookable_flag | 72739.0 | 0.324668 | 0.468254 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 |
| host_email_verified_flag | 72739.0 | 0.912867 | 0.282032 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 |
| host_phone_verified_flag | 72739.0 | 0.999381 | 0.024865 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 |
| host_work_email_verified_flag | 72739.0 | 0.145232 | 0.352337 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.00000 |
In [12]:
# Categoriacal features
data.describe(include=['object']).T
Out[12]:
| count | unique | top | freq | |
|---|---|---|---|---|
| host_response_time | 65965 | 4 | within an hour | 53510 |
| property_type | 72739 | 33 | entire home | 21773 |
| room_type | 72739 | 4 | entire home/apt | 58660 |
| state | 72739 | 3 | california | 48950 |
| city | 72739 | 5 | los angeles | 37296 |
Descriptive analysis¶
The following summaries include various statistics about each feature and even show their relationship with the target variable. Moreover, the detailed summary of the target variable shows feature importance rankings for both numerical and categorical features.
In [13]:
report = sv.analyze(data.sample(frac=0.1, random_state=123), target_feat='price')
report.show_notebook()
Done! Use 'show' commands to display/save. |██████████| [100%] 00:03 -> (00:00 left)
Since price follows an exponential distribution, let's apply a log10 transformation to double-check its relationship with the predictor variables
In [14]:
# Transform price using log10
data_log_price = data.copy()
data_log_price['log_price'] = np.log10(data_log_price['price'])
data_log_price.pop('price')
report = sv.analyze(data_log_price.sample(frac=0.1, random_state=123), target_feat='log_price')
report.show_notebook()
Done! Use 'show' commands to display/save. |██████████| [100%] 00:02 -> (00:00 left)